13 research outputs found
ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Current speaker verification techniques rely on a neural network to extract
speaker representations. The successful x-vector architecture is a Time Delay
Neural Network (TDNN) that applies statistics pooling to project
variable-length utterances into fixed-length speaker characterizing embeddings.
In this paper, we propose multiple enhancements to this architecture based on
recent trends in the related fields of face verification and computer vision.
Firstly, the initial frame layers can be restructured into 1-dimensional
Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we
introduce Squeeze-and-Excitation blocks in these modules to explicitly model
channel interdependencies. The SE block expands the temporal context of the
frame layer by rescaling the channels according to global properties of the
recording. Secondly, neural networks are known to learn hierarchical features,
with each layer operating on a different level of complexity. To leverage this
complementary information, we aggregate and propagate features of different
hierarchical levels. Finally, we improve the statistics pooling module with
channel-dependent frame attention. This enables the network to focus on
different subsets of frames during each of the channel's statistics estimation.
The proposed ECAPA-TDNN architecture significantly outperforms state-of-the-art
TDNN based systems on the VoxCeleb test sets and the 2019 VoxCeleb Speaker
Recognition Challenge.Comment: proceedings of INTERSPEECH 202
Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization
In this paper we describe the top-scoring IDLab submission for the
text-independent task of the Short-duration Speaker Verification (SdSV)
Challenge 2020. The main difficulty of the challenge exists in the large degree
of varying phonetic overlap between the potentially cross-lingual trials, along
with the limited availability of in-domain DeepMine Farsi training data. We
introduce domain-balanced hard prototype mining to fine-tune the
state-of-the-art ECAPA-TDNN x-vector based speaker embedding extractor. The
sample mining technique efficiently exploits speaker distances between the
speaker prototypes of the popular AAM-softmax loss function to construct
challenging training batches that are balanced on the domain-level. To enhance
the scoring of cross-lingual trials, we propose a language-dependent s-norm
score normalization. The imposter cohort only contains data from the Farsi
target-domain which simulates the enrollment data always being Farsi. In case a
Gaussian-Backend language model detects the test speaker embedding to contain
English, a cross-language compensation offset determined on the AAM-softmax
speaker prototypes is subtracted from the maximum expected imposter mean score.
A fusion of five systems with minor topological tweaks resulted in a final
MinDCF and EER of 0.065 and 1.45% respectively on the SdSVC evaluation set.Comment: proceedings of INTERSPEECH 202
Behavioral Analysis of Pathological Speaker Embeddings of Patients During Oncological Treatment of Oral Cancer
In this paper, we analyze the behavior of speaker embeddings of patients
during oral cancer treatment. First, we found that pre- and post-treatment
speaker embeddings differ significantly, notifying a substantial change in
voice characteristics. However, a partial recovery to pre-operative voice
traits is observed after 12 months post-operation. Secondly, the same-speaker
similarity at distinct treatment stages is similar to healthy speakers,
indicating that the embeddings can capture characterizing features of even
severely impaired speech. Finally, a speaker verification analysis signifies a
stable false positive rate and variable false negative rate when combining
speech samples of different treatment stages. This indicates robustness of the
embeddings towards other speakers, while still capturing the changing voice
characteristics during treatment. To the best of our knowledge, this is the
first analysis of speaker embeddings during oral cancer treatment of patients.Comment: proceedings of INTERSPEECH 202
The Idlab Voxsrc-20 Submission: Large Margin Fine-Tuning and Quality-Aware Score Calibration in DNN Based Speaker Verification
In this paper we propose and analyse a large margin fine-tuning strategy and
a quality-aware score calibration in text-independent speaker verification.
Large margin fine-tuning is a secondary training stage for DNN based speaker
verification systems trained with margin-based loss functions. It enables the
network to create more robust speaker embeddings by enabling the use of longer
training utterances in combination with a more aggressive margin penalty. Score
calibration is a common practice in speaker verification systems to map output
scores to well-calibrated log-likelihood-ratios, which can be converted to
interpretable probabilities. By including quality features in the calibration
system, the decision thresholds of the evaluation metrics become
quality-dependent and more consistent across varying trial conditions. Applying
both enhancements on the ECAPA-TDNN architecture leads to state-of-the-art
results on all publicly available VoxCeleb1 test sets and contributed to our
winning submissions in the supervised verification tracks of the VoxCeleb
Speaker Recognition Challenge 2020.Comment: proceedings of ICASSP 202
ECAPA-TDNN Embeddings for Speaker Diarization
Learning robust speaker embeddings is a crucial step in speaker diarization.
Deep neural networks can accurately capture speaker discriminative
characteristics and popular deep embeddings such as x-vectors are nowadays a
fundamental component of modern diarization systems. Recently, some
improvements over the standard TDNN architecture used for x-vectors have been
proposed. The ECAPA-TDNN model, for instance, has shown impressive performance
in the speaker verification domain, thanks to a carefully designed neural
model.
In this work, we extend, for the first time, the use of the ECAPA-TDNN model
to speaker diarization. Moreover, we improved its robustness with a powerful
augmentation scheme that concatenates several contaminated versions of the same
signal within the same training batch. The ECAPA-TDNN model turned out to
provide robust speaker embeddings under both close-talking and distant-talking
conditions. Our results on the popular AMI meeting corpus show that our system
significantly outperforms recently proposed approaches
Tackling the score shift in cross-lingual speaker verification by exploiting language information
This paper contains a post-challenge performance analysis on cross-lingual speaker verification of the IDLab submission to the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21). We show that current speaker embedding extractors consistently underestimate speaker similarity in within-speaker cross-lingual trials. Consequently, the typical training and scoring protocols do not put enough emphasis on the compensation of intra-speaker language variability. We propose two techniques to increase cross-lingual speaker verification robustness. First, we enhance our previously proposed Large-Margin Fine-Tuning (LM-FT) training stage with a mini-batch sampling strategy which increases the amount of intra-speaker cross-lingual samples within the mini-batch. Second, we incorporate language information in the logistic regression calibration stage. We integrate quality metrics based on soft and hard decisions of a VoxLingua107 language identification model. The proposed techniques result in a 11.7% relative improvement over the baseline model on the VoxSRC-21 test set and contributed to our third place finish in the corresponding challenge
Exploiting speaker embeddings for improved microphone clustering and speech separation in ad-hoc microphone arrays
For separating sources captured by ad hoc distributed microphones a key first step is assigning the microphones to the appropriate source-dominated clusters. The features used for such (blind) clustering are based on a fixed length embedding of the audio signals in a high-dimensional latent space. In previous work, the embedding was hand-engineered from the Mel frequency cepstral coefficients and their modulation-spectra. This paper argues that embedding frameworks designed explicitly for the purpose of reliably discriminating between speakers would produce more appropriate features. We propose features generated by the state-of-the-art ECAPA-TDNN speaker verification model for the clustering. We benchmark these features in terms of the subsequent signal enhancement as well as on the quality of the clustering where, further, we introduce 3 intuitive metrics for the latter. Results indicate that in contrast to the hand-engineered features, the ECAPA-TDNN-based features lead to more logical clusters and better performance in the subsequent enhancement stages - thus validating our hypothesis
Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification
This paper describes the IDLab submission for the text-independent task of the Short-duration Speaker Verification Challenge 2021 (SdSVC-21). This speaker verification competition focuses on short duration test recordings and cross-lingual trials, along with the constraint of limited availability of in-domain DeepMine Farsi training data. Currently, both Time Delay Neural Networks (TDNNs) and ResNets achieve state-of-the-art results in speaker verification. These architectures are structurally very different and the construction of hybrid networks looks a promising way forward. We introduce a 2D convolutional stem in a strong ECAPA-TDNN baseline to transfer some of the strong characteristics of a ResNet based model to this hybrid CNN-TDNN architecture. Similarly, we incorporate absolute frequency positional encodings in an SE-ResNet34 architecture. These learnable feature map biases along the frequency axis offer this architecture a straightforward way to exploit frequency positional information. We also propose a frequency-wise variant of Squeeze-Excitation (SE) which better preserves frequency-specific information when rescaling the feature maps. Both modified architectures significantly outperform their corresponding baseline on the SdSVC-21 evaluation data and the original VoxCeleb1 test set. A four system fusion containing the two improved architectures achieved a third place in the final SdSVC-21 Task 2 ranking